AITopics | ui-t ar-1

Collaborating Authors

ui-t ar-1

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GTA1: GUI Test-time Scaling Agent

Yang, Yan, Li, Dongxu, Dai, Yutong, Yang, Yuhao, Luo, Ziyang, Zhao, Zirui, Hu, Zhiyuan, Huang, Junzhe, Saha, Amrita, Chen, Zeyuan, Xu, Ran, Pan, Liyuan, Savarese, Silvio, Xiong, Caiming, Li, Junnan

arXiv.org Artificial IntelligenceOct-7-2025

Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks. The code and models are released here.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.05791

Genre:

Workflow (1.00)
Research Report (1.00)
Overview (1.00)

Industry: Information Technology (0.70)

Technology:

Information Technology > Software (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Lei, Bin, Xu, Nuo, Payani, Ali, Hong, Mingyi, Liao, Chunhua, Cao, Yu, Ding, Caiwen

arXiv.org Artificial IntelligenceOct-7-2025

Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GT A-1-7B (50.1% with 1.56M training samples). Recent rapid advances in multimodal large language models have driven swift progress in GUI agents capable of handling complex tasks on general graphical user interfaces (GUIs) (Xie et al., 2024; Wu et al., 2024a). Nevertheless, current GUI agents still lack robust, fine-grained visual grounding, making it difficult to translate what to do into where to act on complex, dynamically changing screens (Jang et al., 2024; Xie et al., 2025).

arxiv preprint arxiv, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2510.04039

Genre: Research Report (1.00)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)

Add feedback

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Huang, Kung-Hsiang, Qiu, Haoyi, Dai, Yutong, Xiong, Caiming, Wu, Chien-Sheng

arXiv.org Artificial IntelligenceOct-2-2025

Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.

machine learning, natural language, ui-t ar-1, (17 more...)

arXiv.org Artificial Intelligence

2510.00536

Country:

Europe (0.68)
North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.68)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Zhang, Miaosen, Xu, Ziqiang, Zhu, Jialiang, Dai, Qi, Qiu, Kai, Yang, Yifan, Luo, Chong, Chen, Tianyi, Wagle, Justin, Franklin, Tim, Guo, Baining

arXiv.org Artificial IntelligenceAug-1-2025

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2507.23779

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Software (0.45)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(8 more...)

Add feedback

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Wang, Xuehui, Wu, Zhenyu, Xie, JingJing, Ding, Zichen, Yang, Bowen, Li, Zehao, Liu, Zhaoyang, Li, Qingyun, Dong, Xuan, Chen, Zhe, Wang, Weiyun, Zhao, Xiangyu, Chen, Jixuan, Duan, Haodong, Xie, Tianbao, Yang, Chenyu, Su, Shiqian, Yu, Yue, Huang, Yuan, Liu, Yiqian, Zhang, Xiao, Zhang, Yanting, Yue, Xiangyu, Su, Weijie, Zhu, Xizhou, Shen, Wei, Dai, Jifeng, Wang, Wenhai

arXiv.org Artificial IntelligenceJul-28-2025

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.19478

Country: Asia > China (0.93)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback